New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add doc to recover from pod from lost node #8742
Conversation
|
||
### Shorten the timeout | ||
|
||
To shorten the timeout, you can mark the node as "blacklisted" so Rook can safely failover the pod sooner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case the node is just offline and there is no watcher active, we need to just blacklist the whole node, rather than blacklist just a session id, right? Seems like we could simplify this section to just blacklist the node ip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Response? I don't understand why we would want to only backlist only a session id instead of always blocklisting the whole node. The point is also to prevent a node from coming back online and create a new session, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Madhu-1 can you help here? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to blacklist the IP as we want to block all sessions of that node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so if we want to blacklist all sessions, we only need the node ip, right? And no need to get the PV session IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we just need the Node IP to blacklist it.
9bcd369
to
2b96b94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start the sentence with a capital letter in your commit message.
2b96b94
to
514a2ba
Compare
514a2ba
to
2bf5381
Compare
|
||
### Shorten the timeout | ||
|
||
To shorten the timeout, you can mark the node as "blacklisted" so Rook can safely failover the pod sooner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Response? I don't understand why we would want to only backlist only a session id instead of always blocklisting the whole node. The point is also to prevent a node from coming back online and create a new session, right?
2bf5381
to
123fe41
Compare
123fe41
to
e6331b4
Compare
$ PV_NAME= # enter pv name | ||
$ IMAGE=$(kubectl get pv $PV_NAME-o jsonpath='{.spec.csi.volumeHandle}' | cut -d '-' -f 6- | awk '{print "csi-vol-"$1}') | ||
$ echo $IMAGE | ||
``` | ||
|
||
The solution is to remove the watcher, following the commands below from the [Rook toolbox](ceph-toolbox.md): | ||
|
||
```console | ||
$ rbd status <image> --pool=<pool name> # get image from above output | ||
``` | ||
>``` | ||
> Watchers: | ||
> watcher=10.130.2.1:0/2076971174 client.14206 cookie=18446462598732840961 | ||
>``` | ||
|
||
```console | ||
$ ceph osd blacklist add 10.130.2.1:0 # to know which watcher to block see above output | ||
blacklisting 10.130.2.1:0 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@travisn before I make changes just to confirm I'll remove the above part and will just ask the user to get node IP(the node which is lost) and blacklist that.
and if Ceph version > octopus we'll use ceph osd blacklist
else ceph osd blocklist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the ceph version is pacific and above use blocklist
else use blacklist
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, we're just blocking the node, rather than a session id.
e6331b4
to
815b31a
Compare
This pull request has merge conflicts that must be resolved before it can be merged. @subhamkrai please rebase it. https://rook.io/docs/rook/latest/development-flow.html#updating-your-fork |
815b31a
to
29b1391
Compare
@travisn ^^^ |
29b1391
to
f089198
Compare
f089198
to
4e1906f
Compare
This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same volume`. Closes: rook#1507 Signed-off-by: subhamkrai <srai@redhat.com>
4e1906f
to
7587704
Compare
docs: add doc to recover from pod from lost node (backport #8742)
this commit adds the doc which has the manual
steps to recover from the specific scenario
like
on the node lost, the new pod can't mount the same volume
.Closes: #1507
Signed-off-by: subhamkrai srai@redhat.com
Description of your changes:
Which issue is resolved by this Pull Request:
Resolves #1507
Checklist:
make codegen
) has been run to update object specifications, if necessary.